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We consider the problem of adaptation to the margin in binary 
classification. We suggest a penalized empirical risk minimization 
classifier that adaptively attains, up to a logarithmic factor, fast op- 
timal rates of convergence for the excess risk, that is, rates that can 
be faster than n~ 1//2 , where n is the sample size. We show that our 
method also gives adaptive estimators for the problem of edge esti- 
mation. 

1. Introduction. Consider observations (Xi,Y\), . . . , (X n , Y n ), where Y^ is 
a bounded response random variable and Xi G X is the corresponding in- 
stance. We regard {(Xi,Y)}"=i as i.i.d. copies of a population version (X, Y). 
The goal is to predict the response Y given the value of the instance X. We 
consider two statistical problems: binary classification and boundary estima- 
tion in binary images (edge estimation). In the classification setup Y{ G {0, 1} 
is a label (e.g., {ill, healthy}, {white, black}, etc.), while in edge estimation 
Y{ can be either a label or a general bounded random variable. Most of the 
paper will be concerned with the model of binary classification. The results 
for edge estimation are quite analogous and they will be stated as corollaries 
in Section 6. 

Any subset G of the instance space X may be identified with its indi- 
cator function 1q, that is, with a classification rule or classifier G which 
predicts Y = 1 iff X G G. The prediction error R(G) of the classifier G is 
the probability that it predicts the wrong label, that is, 

(1.1) R(G) = B([Y-1 G (X)] 2 ). 

Let rj(X) = P(Y = l\X) be the regression of Y on X. The Bayes rule is the 
classifier 

(1.2) G* = {xeX: V (x)>l/2}. 
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This rule is optimal in the sense that it minimizes the prediction error over 
all G C X [see, e.g., Devroye, Gyorfi and Lugosi (1996)]. The regression rj is 
generally unknown. We consider the construction of an estimator G n C X 
of the Bayes rule G* without directly estimating rj. 

The performance of a classifier G n is measured by its excess risk E(R(G n )) — R(G*). 
It is well known that for various classifiers the excess risk converges to as 
n — > oo at the rate re -1 / 2 or slower [see Devroye, Gyorfi and Lugosi (1996) 
and Vapnik (1998), where one can find further references]. Moreover, under 
conditions on the identifiability of the minimum of the risk R(-) called mar- 
gin conditions, some classifiers can attain fast rates, that is, rates that are 
faster than n -1 / 2 . The existence of such fast rates in classification problems 
has been established by Mammen and Tsybakov (1999). They showed that 
optimal rates of convergence of the excess risk to depend on two param- 
eters: complexity of the class of candidate sets G (parameter p) and the 
margin parameter k which characterizes the extent of identifiability. Their 
construction was nonadaptive supposing that p and k were known. Tsy- 
bakov (2004) suggested an adaptive classifier that attains the fast optimal 
rates, up to a logarithmic factor, without prior knowledge of the parameters 
p and k, thus solving the so-called adaptation to the margin problem. The 
classification rule suggested by Tsybakov (2004) is based on multiple pre- 
testing aggregation of empirical risk minimizers over a collection of classes of 
candidate sets G. This procedure differs significantly from penalized empir- 
ical risk classifiers that are widely used in modern practice of classification 
[cf. Scholkopf and Smola (2002)]. Subsequently there has been a discussion 
in the literature of whether penalized classifiers can adaptively attain fast 
optimal rates. In particular, Koltchinskii and Panchenko (2002) and Au- 
dibert (2004) proposed convex combinations of classifiers, and Koltchinskii 
(2001) and Lugosi and Wegkamp (2004) suggested data-dependent penal- 
ties. The resulting adaptive classifiers converge with rates that can be faster 
than ra -1 / 2 but that are different from the optimal rates in a minimax sense 
considered in Tsybakov (2004). 

This paper answers affirmatively to the above question: penalized classi- 
fiers can adaptively attain fast optimal rates. Moreover, the penalty allowing 
one to achieve this effect is not data-dependent or randomized. It is very sim- 
ple and essentially arises from a sparsity argument similar to the one used 
in the wavelet thresholding context. Interestingly, the penalty is not of the 
£i-type as for soft thresholding and not of the ^crtype as f° r hard thresh- 
olding, but rather of an intermediate, block- wise £1/2 or "square root" type. 
Inspection of the proof shows that the effect is very pointed, that is, the 
proof heavily relies on our particular choice of the penalty. 

The classifier G n that we study is constructed as follows. Let 

1 n 

(1.3) R n (G) = -Y J (Y l -t G (X l )f 

i=i 
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be the empirical risk of a classifier G C X. Note that R n {G) is the proportion 
of observations misclassified by G and that its expectation R(G) = E(i? n (G7)) 
is the prediction error. Assume that X = (S, T) G X = [0, l] d+1 , with S € 
[0, l] d {d < logn), and T £ [0, 1]. A boundary fragment is a subset G of X of 
the form 

(1.4) G = {(s,t)eX:f(s)>t} 

where / is a function from [0, l] d to [0,1] called the edge function. We let 
G n be a minimizer of the penalized empirical risk 

(1.5) R n (G)+Pen(G) 

over a large set of boundary fragments G. Here Pen(G7) is a penalty on the 
roughness of the boundary. The purpose of the penalty is to avoid overfitting. 
We will show that a weighted square root penalty [see (2.2) and (2.3)] results 
in a classifier with the adaptive properties as discussed above. 

A refinement as compared to Tsybakov (2004) is that we do not only con- 
sider adaptation in a minimax sense but also adaptation to the oracle. We ob- 
tain asymptotically exact oracle inequalities and then get minimax adapta- 
tion as a consequence. We work under somewhat different assumptions than 
in Tsybakov (2004). They are slightly more restrictive as concerns the model. 
For example, we consider only boundary fragments as candidates for G. The 
class of boundary fragments is possibly a genuine restriction, although some 
generalizations to other classes of sets are clearly feasible. On the other hand, 
our assumptions allow us to adapt to more general smoothness (complex- 
ity) properties of G. For example, Vapnik-Chervonenkis classes of sets G 
(corresponding approximately to p = 0, see Section 5) or the classes of sets 
with very nonsmooth boundaries (corresponding to p > 1) are covered by 
our approach. 

As a corollary of the results, we obtain an adaptive estimator in the prob- 
lem of edge estimation considered by Korostelev and Tsybakov (1993). The 
statistical model in that problem is similar to the one described above. How- 
ever, it treats the situation characteristic for image analysis where the Xj's 
are uniformly distributed on X, and the error criterion is not the excess risk 
but rather the risk E(//d+i(Cr n ACr*)), where A is the symbol of symmetric 
difference between sets and Hd+i denotes the Lebesgue measure on [0, l] d+1 . 

The paper is organized as follows. In Section 2 we define our adaptive clas- 
sifier. In Section 3 we introduce some notation and assumptions. Section 4 
presents the main oracle inequality. In Section 5 we apply this inequality 
to get minimax adaptation results. Section 6 discusses the consequences for 
edge estimation. Proofs are given in Section 7. 
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2. Definition of the adaptive classifier. Let {tpk '■ k = 1, . . . , n} be an or- 

thonormal system in L2QO, l] rf > A*d) where [id is the Lebesgue measure on [0, l] d . 
For a G R n define 

n 

(2.1) /«(«) = E a ^*( s )« ^e[o,i] d - 

fc=i 

Introduce a double indexing for the system {ipk}, namely 

{ijj k : k = 1, . . . , n} = {^j : j G 7 Z , I = 1, . . . , L} 
where //, I = 1, . . . , L, are disjoint subsets of {1, ... , n} such that 

Y.\h\=n. 

1=1 

Here \A\ denotes the cardinality of the set A. One may think of {ipj : {\ as of 
a wavelet-type system with the index I corresponding to a resolution level. 
A vector a G R n can be written with this double indexing as a = (ctjj). 

For a linear classification rule defined by the set G a = {(s,t) G X : f a (s) > 
t}, consider the penalty 

(2.2) Pen(G Q ) = A nV /7(o^ 
where /(•) is a nonsparsity measure of the form 

(2-3) J(a) = fx:^ /2 JEKV 

V 1=1 V jeh 

for certain weights (wi). In what follows we take the weights as 
(2.4) Wl = 2 dl /\ l = l,...,L, 

and we prove our results for wavelet-type bases (cf. Assumption B below). 
An extension to other bases {ipk} is possible where the block sizes \Ii\ should 
be chosen in an appropriate way [e.g., as in Cavalier and Tsybakov (2001)]. 
The weights wi should moreover be defined as a function of |//|. We do not 
pursue this issue here because it requires different techniques. Thus, in this 
paper we consider penalties based on 




One may think of {<x,-,;} as the coefficients of the expansion of a function 
in the Besov space B a%p%q {]f), l] d ), with p = 1, q = 1/2 and smoothness a = 
(d + l)d/2 [so that the effective smoothness is s = a/d = (d + l)/2]; see, for 
example, DeVore and Lorentz (1993). 
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We propose the estimator G n = G& n where 

(2.5) a n = arg min {R n (G a ) + X n J 1(a)}. 

Here X n > is a regularization parameter that will be specified in Theorem 1. 
We refer to X n y^T(a) as a (block- wise) l^n or square root penalty. 

One may compare (2.5) to a wavelet thresholding estimator for regression. 
The difference here is that because our problem is nonlinear, we cannot 
express the solution a n in a levelwise form, and we need to treat all the 
coefficients ctjj globally. 

3. Notation and assumptions. Let GAG' be the symmetric difference 
between two sets G and G', and let Q denote the distribution of X. For a 
Borel function / : [0, l] d -> [0, 1], we let 

(3-1) 11/11!= J \f(s)\dn d (s) 

be its Li-norm. (Recall that \i d denotes the Lebesgue measure on [0, l] d .) 
Note that 

(3-2) fld+l{G a AG a i) = \\f a — /a,' ||l = H/q-q'IIi- 



Assumption A. For some (unknown) k > 1 and uq > and for all a G 
R n we have 

(3.3) R{G a ) — R(G*) > —Q K (G a AG*). 

Assumption A is a condition on sharpness of identifiability for the mini- 
mum of the risk. We will call it the margin condition. We refer to Tsybakov 
(2004) for a discussion of this condition. In particular, it is related to the 
behavior of the probability Q(\i](X) — 1/2| < t) for small t. The case k = 1 
corresponds to a jump of n at the boundary of G* , and this is the most favor- 
able case for estimation, while k — > oo corresponds to a "plateau" around 
the boundary, and this is the least favorable case. For more discussion of 
the margin condition in relation to convex aggregation of classifiers, such as 
boosting, see Bartlett, Jordan and McAuliffe (2003) and Blanchard, Lugosi 
and Vayatis (2003). 

We will also require the following condition on the basis. 

Assumption B. The system of functions {tpj,i,j G ij,Z = 1, . . . ,L} is 
orthonormal in -L2QO, Hd) an d satisfies, for some constant > 1, 

(3.4) H^llx^c^-^ 2 , l = l,...,L, 



G 
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(3.5) sup X>j,iO0l <^2 dl '\ l = l,...,L, 



(3.6) 2 dl / ClP <\Ii\ <c^2 dl 

and 

logn 



(3.7) L < cw, 



d 



Assumption B makes it possible to relate ||/q||i to 1(a) in a suitable 
way (cf. Lemmas 1 and 2). Note that Assumption B is quite standard. It is 
satisfied, for instance, for usual bases of compactly supported wavelets [cf. 
Hardle, Kerkyacharian, Picard and Tsybakov (1998), Chapter 7]. 

Note also that (3.7) follows from (3.6) with a different constant. To sim- 
plify the exposition and calculations, we take the same constant in all the 
conditions (3.4)-(3.7) and suppose that this constant is not smaller than 1. 

Remark 1. It will be clear from the proofs of Lemmas 1 and 2 that 
Assumption B can be relaxed. Namely, the orthonormality of {ipjj} and (3.4) 
can be replaced by the conditions 

EM 2 ^ /2 <^ll/«lli> i = i,..., l, 

jeh 

L 

||/«||iAV < E E \ a Jp~ dl/2 Va G R n . 
;=i jeii 

Finally we introduce an assumption which will allow us to interchange 
Lebesgue measure and Q. 

Assumption C. The distribution Q of X admits a density g(-) with 
respect to Lebesgue measure in [0, and for some constant 1 < qo < oo 

one has l/qo< q(x) < qo for all x G [0, l] d+1 . 



4. An oracle inequality. For a G R n let 

(4.1) m(a) = min{m : otjj = for all j G // with I > m} 
and 

(4.2) N(a)=N m{a) , 
with 

m 

(4.3) N m = Y,\Il\, m = l,2,...,L. 

l=i 
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Assume that there exists a oraclc G R n such that 

R{G a or, clc ) - R(G*) + K(iV(a oraclG )) 



(4.4) 



min {R(G a ) - R(G*) + V n (N(a))}, 



where 

(4.5) V n (N) = 4c K (4c d q cy /K X 2 n Nr/^^ 

and where c K = (2«- IJ/^kJ/s-VC 2 *- 1 ) and c d = 2{2 d - l)/{2 d / 2 - l) 2 . Note 
that V n {N{a)) depends on the regularization parameter A n , which we shall 
take of order V log 4 n/n [see (4.6) in Theorem 1 below]. Then a oracle can 
be interpreted as an oracle attaining nearly ideal performance. In fact, the 
term R{G a ) — R(G*) in (4.4) may be viewed as an approximation error, 
while V n {N{a)) is related to the stochastic error, as will be clear from the 
proofs. In other words, nearly ideal performance is attained by the value 
a oracie ^hat trades off generalized bias and variance. 

Theorem 1. Suppose that Assumptions A-C are met. Then there exists 
a universal constant C such that for 



n 



(4.6) A„ = C V nd 

and for any 8 E (0, 1] and n > Sqoc 2 ^ we have 



P R(G n ) - R(G*) > (1 + S) inf {R{G a )-R{G*) 



(4.7) +5^ 2K ^V n {N{a))} + 2X r , 

c</Tog 4 n" 



log n 



n 



< C exp 



C 2 d 



Theorem 1 shows that, up to a constant factor and a small remainder 
term 2A n v / log 4 n/n, the estimator G n mimics the behavior of the oracle. If 
5 is chosen small enough or converging to 0, for example, 5= 1/logn, the 
factor preceding the infimum in the oracle inequality (4.7) approaches 1. 

The regularization parameter A n x Vn -1 log 4 n appearing in Theorem 1 is 
larger than the choice V n _1 log re used for wavelet thresholding in regression 
or density estimation. The value of A n is imposed by an inequality for the 
empirical process that controls the stochastic error. Lemma 4 presents such 
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an inequality, and the additional logra factors are due to the result given 
there. 

As a consequence of (4.7) and of the fact that < R(G) < 1 for all G, we 
get the following inequality on the excess risk: 

E(R(G n ))-R(G*)<(l + 5) 2 inf {R(G a ) — R(G*) + S~ 1 ^ 2K ~ r> V n (N(a))} 

V n 

This inequality bounds the excess risk by the oracle risk of a linear clas- 
sification rule G a for any form of Bayes rule G* . We emphasize that G* is 
not necessarily a boundary fragment, and R(G a ) — R(G*) is not necessarily 
small. The results of this section are thus of the learning theory type [cf. 
Devroye, Gydrfi and Lugosi (1996) and Vapnik (1998)]. In the next section 
we will show that if G* is a boundary fragment satisfying some regularity 
conditions, the excess risk converges to zero at a fast rate. 

5. Minimax adaptation. Here we will consider a minimax problem and 
we will show how the oracle inequality of Section 4 can be used to prove 
that our classifier adaptively attains fast optimal rates under smoothness 
assumptions on the edge function. 

Since in a minimax setup results should hold uniformly in the underlying 
distribution, we first introduce some notation to express the dependence of 
the margin behavior on the distribution of (X,Y). Let us keep d and also Q 
fixed. Then the joint distribution of (X, Y) is determined by the conditional 
probability rj(x) of the event Y = 1 given that X = x. Let 7i be the class 
of all Borel functions r/ on X satisfying < r] < 1. For a given r\ let 
dP^x^y) be the probability measure 

dP v (x,y) = (yri(x) + (1 - y)(l - n(x))) dQ(x), (x,y) eXx{0, 1}. 

Let G* be Bayes rule when (X,Y) has distribution P v . Finally, let denote 
expectation w.r.t. the distribution of {(Xi,Yi)}^ =1 under P^. Now fix the 
numbers do > and n > 1 and define the collection of functions 

H K = LeH:G; = {(s,t)eX:f*{s)>t}, 
(5.1) —Q K (G a AG*) < R(G a ) — R(G*) 

<c g K ||/a-4*||So for all a £ R n j , 

where || • ||oo denotes the Loo-norm on [0, l] d endowed with Lebesgue mea- 
sure, and R(-) depends on r\ but in the notation we omit this dependence 



c^log n 
C 2 d 
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for brevity. Note that we assume a lower as well as an upper bound for 
the excess risk in definition (5.1), and in view of Assumption C and (3.2), 
Q K (G a AG*) < qo\\fa — /nlloo- This means that our assumption is less re- 
strictive than requiring that the lower bound be tight. 

Let moreover p > be a parameter characterizing the complexity of the 
underlying set of boundary fragments and let cq be some constant. Denote 
by T p a class of functions / : [0, l] d — > [0, 1] satisfying the following condition: 
for every / € J- p and every integer m< L one has 

(5-2) min \\f a - < c^Ip. 

a : m(a)<m 

This is true for various smoothness classes (Sobolev, Holder and certain 
Besov classes) with l/p = j/d, where 7 is the regularity of the boundary / 
(e.g., the number of bounded derivatives of /), and various bases {ijjk} [cf-> 
e.g., Hardle, Kerkyacharian, Picard and Tsybakov (1998), Corollary 8.2 and 
Theorem 9.6]. 

Denote by Q p a class of boundary fragments G = {(s,t) E X :f{s) > t} 
such that / G T p . 

Theorem 2. Suppose that Assumptions B and C are met. Then 

(5.3) sup [B v (R(G n ))-R(G;)] = (-^— ) 

as n — » 00 . 

Remark 2. For Holder classes J- p , the result of Theorem 2 is optimal 
up to a logarithmic factor [cf. Mammen and Tsybakov (1999) and Tsybakov 
(2004)]. Note that we cover here all values p > 0, thus extending the adap- 
tive result of Tsybakov (2004) to p > 1 (i.e., to very irregular classes of 
boundaries). The case p = can be also introduced: it corresponds to the 
assumption that (5.2) holds with in the right-hand side. The class of func- 
tions / thus defined is a Vapnik-Chervonenkis class, and it is easy to see 
that the rate in Theorem 2 in this case becomes (n _1 log 4 n) K ^ 2K_1 \ 

6. Edge estimation. In this section we consider the problem of estimation 
of the edge function /* such that G* = {(s, t) G X : f* (s) > t} , using the sam- 
ple {(Xi,Yi)}f =l . The risk for this problem is defined by B(fi d+ i(G n AG*)) = 
— where /„ = f& n is the estimator of /* obtained by our method. 
Using the definition of 7i K we immediately get the following corollary of The- 
orem 2. 
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Corollary 1. Suppose that Assumptions B and C are met. Then 

(6.1) sup eju O ((-^— ) 

as n — > oo . 

Note that the setup of Corollary 1 is somewhat different from the standard 
problem of edge estimation as defined by Korostelev and Tsybakov (1993). 
In fact, it is in a sense more general because here the AYs are not supposed to 
be uniformly distributed on [0, l] d and the joint distribution of (X,Y) is not 
supposed to follow a specified regression scheme. Also, the margin behavior 
is accounted for by the parameter k. On the other hand, Corollary 1 deals 
only with binary images, Y{ £ {0, 1}, while Korostelev and Tsybakov (1993) 
allow 7j€R, for instance, the model 

(6.2) Y i = l G o(X i )+C i , i = l,...,n, 

where £j is a zero- mean random variable independent of Xi, and consider 
the problem of estimation of the edge function /° assuming that G° is a 
boundary fragment, G° = {(s,t) G X : f°(s) > t}. 

An important example covered by Corollary 1 is the model 

(6.3) Y i = (l + (21 G o(X i )-l)S i )/2, i = l,...,n, 

where £j is a random variable independent of Xi and taking values —1 and 
1 with probabilities 1 — p and p, respectively, 1/2 < p < 1. In this model 
the observations Y^ take values in {0,1} and they differ from the original 
(nonnoisy) image values Y( = l G o(Xi) because some values Y- are switched 
from to 1 and vice versa with probabilities 1—p and p. This occurs, for 
example, if the image is transmitted through a binary channel. The aim is to 
estimate the edge function f° of the set G° assuming that G° is a boundary 
fragment. 

It is easy to see that the regression function r\ for the model (6.3) equals 
nix) =p1qo(x) + (1 — p)(l — 1qo(x)), which implies that the set G° is iden- 
tical to G* and thus f° = f* Also, it is not hard to check that if the distri- 
bution of Aj's is uniform on [0, l] d+1 we have that r\ £ Hi, and Corollary 1 
applies with k = 1. 

Inspection of the proofs below shows that an analog of Corollary 1 also 
holds for the model (6.2) if one assumes that the random variables Y{ are 
uniformly bounded. In this case only the constants in Lemma 4 and in the 
definition of A n should be changed and the set G* should be indexed by the 
corresponding edge function / rather than by the regression r/, other ele- 
ments of the construction remaining intact. This extension is quite obvious, 
and we do not pursue it here in more detail. 
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For k = 1, Corollary 1 gives the rate n~ l ^ p+l \ up to a logarithmic fac- 
tor. As shown by Korostelev and Tsybakov (1993), this rate is optimal in 
a minimax sense when T p is a Holder class of functions and the model is 
(6.2) or (6.3). Barron, Birge and Massart (1999) constructed adaptive esti- 
mators of the edge function in the model (6.2) with d = 1, k = 1, p> po> 
using a penalization with a penalty that depends on the lower bound po 
on p. They proved that for this particular case the optimal rate 

n -i/(p+i) is 

attained by their procedure. Corollary 1 extends these results, showing that 
our method allows adaptation to any complexity p > in any dimension 
d > 1 and also adaptation to the margin k > 1 which is necessary when we 
are not sure that the boundary is sharp, that is, when the regression function 
r\ does not necessarily have a jump at the boundary. Assumption A or (5.1) 
gives a convenient characterization of nonsharpness of the boundary, and 
our penalized procedure allows us to adapt to the degree of non-sharpness. 

7. Proofs. Before going into the technical details, let us first briefly ex- 
plain our choice of class of sets as boundary fragments, and the choice of 
the penalty. When using boundary fragments, it is clear from (3.2) that the 
approximation of sets boils down to approximation of functions in L±. We 
then use linear expansions, and need to relate the coefficients in these ex- 
pansions to the penalty. This is done in Lemmas 1 and 2. Lemma 1 bounds 
the Li-norm by /(•). Lemma 2 bounds /(•) by the Li-norm when the num- 
ber of levels is limited by m. The (block-wise) £1/2 penalty ensures some 
important cancellations in the proof of Theorem 1. Its specific structure is 
less important in Lemmas 3 and 4, with Lemma 4 being a rather standard 
application of empirical process theory. Lemma 3 provides an upper bound 
for the entropy with bracketing (see the definition preceding Lemma 3) of 
the class of sets G a AG a * with a varying, a* fixed, and I(a — a*)<M, 
M > 0. Lemma 4 is the consequence of the entropy result of Lemma 3 for 
the empirical process. 

Lemma 1. Under Assumption B we have, for all a £ R", 
(7.1) <c^I(a). 

Proof. Using (3.4) we obtain 

II fa II 1= J2 a j,l*l>j,l 
3,1 

<Ei«^iii^iii<^Ei«^i 2 ^ /2 

L 

Eo-dlo.il/2V-, 1 
2 2 ' 2J a i.'l- 

1=1 j 
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But clearly, for all I, 



Hence, 

||/ a ||i<^^2- di /(a)<c Hz ,J(a). 

l=i a 

Lemma 2. Ze£ a £ R n and Zei N(a) be defined in (4.2). Then under 
Assumption B 

(7.2) J(a) <cacJJV(a)||/ a ||i. 

Proof. The coefficient ay j is the inner product 

a j,l = J faipjddud, 

so by (3.5), 

\ a j,l\ ^ / I /a I \4>j,l\ dfJ-d 

jeh J jeii 

This implies that for m = m(a), with m(a) given in (4.1), 

m , 



/(a)=E2 (8/4 . /El«i,«l 



<E2^V^||/ Q || 1 



l=i 



2(m+l)d/2 



V " y 

- 2 d / 2 -1 V^ll^ll 1 ' 
Next, by (3.6) and the definition (4.2) of N(a), 

m m o{m+l)d 

Combining these inequalities we get the result. □ 

Definition 1. Let Z C L p (S,v) be a collection of functions on some 
measurable space (S, is), 1 < p < oo. For each 5 > 0, the 5-covering number 
with bracketing Nb, p (8, Z, v) of Z is the smallest value of N such that there 
exists a collection of pairs of functions {[zj, zV]^ =l j that satisfies: 
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• zf < and \\zV — Zj\\ p < 5 for all j G {1, . . . , N} [with || • || p being the 
L p (S, v) -norm], 

• for each z G Z there is a j € {1, ... , N} such that < z < z^ . 

The 5-entropy with bracketing of Z is Hb, p {6, Z, v) = log Nb, p (5, Z, u). 

Definition 2. Let Z be a collection of bounded functions on S. The 
(^-covering number for the sup-norm, N OQ (5,Z), is the smallest number N 
such that there are functions {zj}^ =l with for each z£2, 

min sup \z(s) — Zj(s)\ < 5. 
j=i,-,N se5 

The 5-entropy for the sup-norm is H^^S, Z) = log N^^S, Z). 

Note that when v is a probability measure [cf. van de Geer (2000), page 17], 

(7.3) H B)P (5,Z,u)<H oo (5/2,Z),5>0. 

For a class Q of subsets of (X, Q), we write Hb(5, Q, Q) = Hb i(S, {1q '■ G G 
G},Q). 

Lemma 3. Let a* G R n be fixed. For < M < n define Q M = {G = 
G a AG a * : a G R n , I {a — a*) < M}. Suppose that Assumptions B and C are 
met. Then 

» f M /8qocf h lograX /8qoc 2 ,,n 

(7.4) H B (S,g M ,Q)<^( * W W ^ 



5 V d J °V Sd 
for allO<S<l. 

Proof. Define T M = {f a :a£ ~R n ,I(a) < M}. In view of Assumption C, 
(7.5) H B (q 5,g M ,Q)<H B ,i(S^ M , f i d ), 5>0. 

This and (7.3) show that it is sufficient to bound H 00 (-,F M ). 

Fix some 5 > 0. Our aim is now to bound the quantity Hoo ((c^,d _1 log n)S, T M ) 
To do this, note that one can construct a (c^d~ l logn)<5-net on JF M for 
the sup-norm in the following way. The elements of the net are f a i where 
a'j i takes discretized values with step b2~ dl l 2 . For every ctjj define a'j t as 
the element closest to otj i, of the <52 _rf '/ 2 -net on the interval 

[-M2~ dl / 2 ,M2- dl / 2 ]. 

Note that this interval contains all admissible values of a,ji since \oiji\ < 
M2~ M / 2 , Mj,l for all a such that 1(a) < M. With this definition of a' j t we 



14 A. B. TSYBAKOV AND S. A. VAN DE GEER 

have \ctjj — < 52~ dl / 2 , and thus 

SUp \f a (s) - f a '(s)\ 
sG[0,l] d 

L 

l =1 s£[0,l]d j£ll 

L 

< «5^2-^/ 2 sup E \^jA s )\ < Lc i> 5 < (c^cT 1 logn)<5, 
i=i se[o,i] d ie/; 



where we have used Assumption B for the last two inequalities. Thus we 

4 £ 



have proved that the above construction gives in fact a (c^d 1 logn)5-net 



on T M for the sup-norm. 

Let us now evaluate the cardinality of this net. This will be based on the 
following three observations. 

Observation 1. For every a such that 1(a) < M there exist at most 
M/5 indices k = such that \atj i\ > b2~ dl l 2 . To show this, define 

Ni(a) = \{j G J, : \a jt i\ > S2~ dl / 2 }\, l = l,...,L. 

Then 



M > J 1(a) > E 2 ^ /4 



E M>v^E\W a )- 

\ \a jA \>S2~ dl / 2 1=1 



Hence 



L 



M 



i=i 



and so 



M 



i=i 

Observation 2. For each j and I, we can approximate the interval 
{l a j,z| < M2~ dl / 2 } by a set of cardinality at most 

2M 

— + 1 

such that each coefficient a^; is approximated to within the distance b2~ dl l 2 . 
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Observation 3. The number of different ways to choose < M/5 nonzero 
coefficients out of n is 

£ (;)s(»+i) J,/< 

0<N<min{M/5,n} V ' 

[see, e.g., Devroye, Gyorfi and Lugosi (1996), page 218]. 

It follows from Observation 3 that there exist at most (n + 1) M / S possi- 
bilities to choose the sets of nonzero coordinates of the vectors a' belonging 
to the net. For each of these possibilities the discretization is performed on 
each of the nonzero coordinates, which gives at most 



(2M 



\ M/5 



) 



new possibilities in view of Observations 1 and 2. Thus, the cardinality of 
the considered (c^d~ l logn)J-net on T M is bounded by 



( „+iM^ +1 



M/5 



which implies 

(7.6) H^d- 1 logn)5,F M ) < ^ (log(^ + l) + Iog(n + 1)) • 
In view of (7.5) this yields 



H B (S,G M ,Q)< 



< 



M /2go4 lo § n 
T\ d 

M (2q Q cl ) \ogn 



d 



log 



log 



4gbcjMlogn 



Sd 

4go4n 2 
Sd 



+ 1 + log(n + 1) 



+ i)+ iog(n + r 



since Mlogn < ralogn < re 2 . Continuing with this bound, we arrive at 



H B (5,G M ,Q)< 



2 „2 



M ( 4g c^, log n \ / 4g c^,n 



log 



V 6d 



+ 1 



< 



< 



5 V d 
M /8g <4 log re 
d 



M ( 4<?o4 log n \ / 8g 4n 2 



log 
log 



V Sd 



)4 n 



6d 



□ 



Now we turn to the empirical process 
(7.7) u n (a) = y/n(R n (G a )-R(G a )), 
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Lemma 4. Let Assumptions B and C hold. Then there exists a universal 
constant C such that for n > 8goc3, we have, for all a* £ R n , 



P sup 



\v n (a) - v n {a*)\ 



'<?o4 lo g 4n 



(7.8) 



eR n y/I(a — a*) + Vlog 4 ra/n 



>C\ 



< C exp 



C 2 d 



Proof. We will apply Theorem 5.11 in van de Geer (2000) which, trans- 
lated to our situation, says the following. Let 



h a (X, Y) = (Y- t Ga (X)) 2 — (Y — l Gc , (X)) 2 



and 



H M = {h a :I{a-a*) <M}. 



Also, let R 2 < 1 satisfy 



sup / h 2 dP<R 2 , 

h£H M J 



where P is the law of (X, Y). Then Theorem 5.11 in van de Geer (2000) gives 
that for some universal constant Co, and for all a satisfying both a < ^friR 2 
and 



a>C ( [ H 1 J 2 Ju,H M ,P)duV r) 

\Ja/(CoV^) / 



one has 
(7.9) P 



sup 

a£R" : I(a-a*)<M 



\u n (a) - v n (a*)\ >aj < C exp 



,2 n 



C 2 R 2 



To apply this result, note first that 
(7.10) \(Y-t G 
We therefore get 



(7.10) \(Y-l Ga (X)) 2 -(Y-l G AX)) 2 \ = \t Ga (X)-l G AX)\. 



sup fh 2 dP= sup Q(G), 
h£H M J Geg M 



where Q M be defined as in Lemma 3. Hence by Lemma 1, Assumption C 
and (3.2) we may take 

R 2 = qoC^M A 1. 

Moreover, again by (7.10), 
H B)2 (S, H M ,P) = H B)2 (d, {t G :Ge G M }, Q) = H B (5 2 ,G M ,Q), 5 > 0. 
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Using Lemma 3, for any a < ^/nR 2 , log 4 njn < M <n and n > 8qoc^, we 
get the bound 



f 1 1/9 »r , Qocfi.Mlogn „ 

/ H 1 Jl(u,H M ,P)du < c'\ q ° * log 3 

where c' is a universal constant. We therefore can take 



n 



5/2 



a = cA 



with an appropriate universal constant c. Insert this value for a and the 
value of R in (7.9) to find that for log 4 njn < M <n, and trivially also for 

M > n, 



(7.11) 



sup \v n {a) - u n {a*)\ > c\ 

K aeR n : I(a-a*)<M 



Iqoc^Mlog 4 n 



< Cq exp 



cVog 4 n (Mv 



C 2 d 



1 



The result now follows from the peeling device as, for example, explained 
in Section 5.3 of van de Geer (2000). The argument is then as follows. 
We have 



v n (a) -vn{a*)\ ^ ni /g c 2 log 4 n 



P sup - 

\»eR" y/I(a - a*) + vlog njn 



>C\ 



^jyf \u n {a) - v n {a*)\ 
< P sup , 

\l(a-a*)<l VA" - «*) + Vlog 4 n/ 



>C1 



'go4 lo S 4n 



+ P sup 



\v n {ot) - v n (a*)\ 



>C\ 



d 



\I(a-a*)>l y/I(a - a*) + vlog n/n 
= P/ + P//. 

Furthermore, for jo the integer such that 2~ jf ° < log 4 n/n < 2 - *> +1 , we find 

jo 

j=0 



P/<£P sup |^( a ) -i, n (a*)|>^/ ~ 

j=0 V(a-a«)<2-J ^ V 

Similarly, 



P//<5ZP sup |i/ n (a)-i/ n (a*)|>- 

j=i V(<*-<**)< 2;/ 



C /(?o42ilog 4 



j'=i 
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The theorem then follows by choosing C appropriately and applying (7.11) 
to each of the Pjj, j = 0, . . . , jo, and P/jj, j = 1,2, □ 

Lemma 5. For any positive v,t and any k>1, 5 > we have 

where c K = (2k - 1)/(2k)k- 1 ^ 2k ~^ . 

Proof. By the concavity of the log-function, we have for positive a, b, 
x and y, with l/x + l/y = l, 

logfab) = - log(a x ) + - log(b y ) < log(-a x + -bA 
x y \x y J 



or 

ab < -a x + ~b y . 

x y 

The lemma is obtained when we choose 

a = v(K5r l ^ 2K \ b=(K5t) 1 ^\ x = T^T> y = 2K - □ 

We now come to the proof of the main theorem. This proof follows the 
lines of Loubes and van de Geer (2002) [see also van de Geer (2003)]. 

PROOF of Theorem 1. Fix an arbitrary a* G R n . (We stress here that 
a* is just a notation and need not be related in any sense to the Bayes 
rule G* .) Let H be the random event 



(7.12) H = | Mo*) - fn(a*)|/Vn < A n v7(a n - a*) + X n ] 

By Lemma 4, for n sufficiently large, 

n c^log 4 n 
P(s) > 1 — C exp 



log n 



n 



C 2 d 

So we only need to consider what happens on the set 3. The definition of a n 
implies 



Rn(G&J + A n V/(d„) < Rn(G a *) + A nV //(a*), 
which may be rewritten in the form 



(7.13) R(G &n ) < -[v n (a n ) - v n (a*)]/yfti - A n [V/(a n ) - Jl(a*)] + R(G. 
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Hence on S we get 



-i 4 

R(G &n ) < X n Vl(a n -a*) - X n [V/(d„) - 0V) ] + # ) + A n J -I— . 
Let m* = m(a*), and let, for any a, 



/«(a)=E2* /4 jEl^l> 



L 



m(a)= e 2*/* El^l- 

i= m *+i y je/ ( 
Since /( 2 )(a — a*) = /^(a), we now find 

fl(G& n ) < \ n ^Jim(a n -a*) + X n y/lW(& n ) - X n [^WfaJ - y/lW(a*)] 



X n Jm(a n ) + R(G^) + X n ) 



log 4 n 



n 



A n) /jW(a„ - a*) - X n [^JlW(a n ) - JlW(a*)] 
+ R(G a * ) + 



log n 



n 

Since for any a, b G R, \/|a[ — \/|&[ < y/\a — b\, we arrive at 



(7.14) R(G &n ) < 2X n JlW(a n - a*) + R(G a *) + X r 



log 4 n 

n 



Therefore, using a straightforward modification of Lemma 2 (basically re- 
placing there / by 1^), we obtain 



R(G &n ) < 2X n ^c d clN*\\f &n _ a 4 1 + R(G a * ) + A n y 

where iV* = N(a*) = Y,T=i B Y Assumption C and (3.2) 

H/a„-a*Hi<9oQ(G dn AG Q *)- 

We therefore get 



log n 



n 



R(Ga n ) < 2\ n Jc d q (*N*Q(G &n AG a *) + R(G a *) + \ n l 



'log n 



n 
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Subtracting R[[G*) from both sides of this inequality, and denoting d[[G, G* 
R(G)-R(G*), we obtain 



d(G &n ,G*) < 2\ n Jc d q Q c*N*Q(G &n AG a *) 
(7-15) , 



' log 4 n 



+ d(G a *,G*) + X n ^ 



n 

But then, by the triangle inequality and \J a + b < \fa + v6, a, b > 0, we get 



d(G &n ,G*) < 2\ ny Jc d q (3N*[y/Q{G &n AG*) + y/Q(G Q *AG* 
+ d(G a * 1 G*) + \ n \ 



' log 4 n 



n 



< 2\Jc d q cPy /K N*[dy^\G &n ,G*) + d 1 ^(G a *,G*)} 



1 log 4 n 



+ d{G a *,G*) + \ 



n 



where in the last inequality we invoked Assumption A. Now we apply Lemma 
5 with, respectively, t = d(Ga n , G*) and t = d(G a * ,G*), to get 

d(G &n ,G*)< (5/2) [d(G &n ,G*) + d(G a * , G*)] 

+ 2cJ~^- 1 H4c dqo cy o /K \lNT /{2K ~ 1) 



+ d(G a *,G*) + \J 1 ^, 
V n 

which, together with the inequalities (1 + S/2)/(l — 5/2) < (1 + 5) 2 and 
1/(1 — 5/2) < 2, which are valid for 5 S (0,1], implies, that on the event S 
we have 

R(G &n )-R(G*) 



< (1 + 5) 2 {R(G a *) - R(G*) + <5- 1 /(2«-D Ki (iV( a *))} + 2xJ 1 ^^. 

V n 

Hence 

P \R(G &n ) - R(G*) > (1 + 5) 2 {R(G a *) - R(G*) 



+ 5 -l/(2 K ~l)y n{N{a * ))} + 2Xn] 



log 4 n 

n 



< C exp 
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C 2 d 



Since a* was chosen arbitrarily this holds in fact for all a*. Because a dis- 
tribution function is right continuous, we now have shown that also 

J>[R(G &n )-R(G*)>(l + 5) 2 inf {R(G a ) - R(G*) 



+ r l/(2 K -l) K(Ar(a))} + 2V 



'log n 



n 



< C exp 



tylog n 
C 2 d 



□ 



Proof of Theorem 2. For r] eH K , G* e G p , we have 

R(G a ) - R{G*) + V n (N(a)) < <T q$\\f a - + F„(iV(a)), 

so that 



(7.16) 
where 



inf {R(G a ) - R(G1) + V n (N(a))} < a q^N~ K /P + V n (N m ) 

a : m(a)<m 

= z(N m ), 



z{t) = a q^t- K /P + V n (t), t>0. 
Now minimizing z(t) over all t > gives 



71 



s, p/(2K+p-l) 



log n 



since V n {N) >c (iVn 1 log 4 ( 2/t ^ . Let fh be the smallest integer such that 

It is not difficult to see, using (3.6) and (3.7), that 

N^-t<cp 2d (t + l). 
Inserting Nfh in the right-hand side of (7.16) therefore gives 

inf {R(G a )-R(Cr„) + V n (N(a))}<z(N A )^(-^—) 

a-.m(a)<rh 1 \ 71 J 

Note finally that the constants in Theorem 1 depend only on d, k, ao, qo 
and c^p , so that the result of Theorem 2 follows easily. □ 

Remark. When this paper was finished we learned from Vladimir Koltchinskii 
that he found another penalized classifier that adaptively attains fast opti- 
mal rates [Koltchinskii (2003)]. His method is different from ours and uses 
randomization and local Rademacher complexities. 
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